2,641 results on '"Han, Kai"'
Search Results
2. Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
- Author
-
Liu, Fangcheng, Tang, Yehui, Liu, Zhenhua, Ni, Yunsheng, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
- Published
- 2024
3. GhostNetV3: Exploring the Training Strategies for Compact Models
- Author
-
Liu, Zhenhua, Hao, Zhiwei, Han, Kai, Tang, Yehui, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Compact neural networks are specially designed for applications on edge devices with faster inference speed yet modest performance. However, training strategies of compact models are borrowed from that of conventional models at present, which ignores their difference in model capacity and thus may impede the performance of compact models. In this paper, by systematically investigating the impact of different training ingredients, we introduce a strong training strategy for compact models. We find that the appropriate designs of re-parameterization and knowledge distillation are crucial for training high-performance compact models, while some commonly used data augmentations for training conventional models, such as Mixup and CutMix, lead to worse performance. Our experiments on ImageNet-1K dataset demonstrate that our specialized training strategy for compact models is applicable to various architectures, including GhostNetV2, MobileNetV2 and ShuffleNetV2. Specifically, equipped with our strategy, GhostNetV3 1.3$\times$ achieves a top-1 accuracy of 79.1% with only 269M FLOPs and a latency of 14.46ms on mobile devices, surpassing its ordinarily trained counterpart by a large margin. Moreover, our observation can also be extended to object detection scenarios. PyTorch code and checkpoints can be found at https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv3_pytorch.
- Published
- 2024
4. SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning
- Author
-
Wang, Hongjun, Vaze, Sagar, and Han, Kai
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet., Comment: Accepted as a conference paper at ICLR 2024; Project page: https://visual-ai.github.io/sptnet
- Published
- 2024
5. A robust audio deepfake detection system via multi-view feature
- Author
-
Yang, Yujie, Qin, Haochen, Zhou, Hang, Wang, Chengcheng, Guo, Tianyu, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
With the advancement of generative modeling techniques, synthetic human speech becomes increasingly indistinguishable from real, and tricky challenges are elicited for the audio deepfake detection (ADD) system. In this paper, we exploit audio features to improve the generalizability of ADD systems. Investigation of the ADD task performance is conducted over a broad range of audio features, including various handcrafted features and learning-based features. Experiments show that learning-based audio features pretrained on a large amount of data generalize better than hand-crafted features on out-of-domain scenarios. Subsequently, we further improve the generalizability of the ADD system using proposed multi-feature approaches to incorporate complimentary information from features of different views. The model trained on ASV2019 data achieves an equal error rate of 24.27\% on the In-the-Wild dataset., Comment: 5 pages, 2 figures
- Published
- 2024
6. Sample Selection Based on Uncertainty for Combating Label Noise
- Author
-
Hao, Shuohui, primary, Liu, Zhe, additional, Song, Yuqing, additional, Liu, Yi, additional, Han, Kai, additional, Sheng, Victor S., additional, and Zhu, Yan, additional
- Published
- 2023
- Full Text
- View/download PDF
7. A Domain Knowledge-Based Semi-supervised Pancreas Segmentation Approach
- Author
-
Ma, Siqi, primary, Liu, Zhe, additional, Song, Yuqing, additional, Liu, Yi, additional, Han, Kai, additional, and Jiang, Yang, additional
- Published
- 2023
- Full Text
- View/download PDF
8. DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
- Author
-
He, Wei, Han, Kai, Tang, Yehui, Wang, Chengcheng, Yang, Yujie, Guo, Tianyu, and Wang, Yunhe
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks. code is avalaible at https://github.com/WailordHe/DenseSSM
- Published
- 2024
9. SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution
- Author
-
Wang, Chengcheng, Hao, Zhiwei, Tang, Yehui, Guo, Jianyuan, Yang, Yujie, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR.
- Published
- 2024
10. Assortment Planning with Sponsored Products
- Author
-
Tang, Shaojie, Cai, Shuzhang, Yuan, Jing, and Han, Kai
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Artificial Intelligence ,Computer Science - Information Retrieval - Abstract
In the rapidly evolving landscape of retail, assortment planning plays a crucial role in determining the success of a business. With the rise of sponsored products and their increasing prominence in online marketplaces, retailers face new challenges in effectively managing their product assortment in the presence of sponsored products. Remarkably, previous research in assortment planning largely overlooks the existence of sponsored products and their potential impact on overall recommendation effectiveness. Instead, they commonly make the simplifying assumption that all products are either organic or non-sponsored. This research gap underscores the necessity for a more thorough investigation of the assortment planning challenge when sponsored products are in play. We formulate the assortment planning problem in the presence of sponsored products as a combinatorial optimization task. The ultimate objective is to compute an assortment plan that optimizes expected revenue while considering the specific requirements of placing sponsored products strategically.
- Published
- 2024
11. Data-efficient Large Vision Models through Sequential Autoregression
- Author
-
Guo, Jianyuan, Hao, Zhiwei, Wang, Chengcheng, Tang, Yehui, Wu, Han, Hu, Han, Han, Kai, and Xu, Chang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens. In this paper, we delve into the development of an efficient, autoregression-based vision model, innovatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models. The code is available at https://github.com/ggjy/DeLVM., Comment: 15 pages
- Published
- 2024
12. Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
- Author
-
Guo, Jianyuan, Chen, Hanting, Wang, Chengcheng, Han, Kai, Xu, Chang, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent advancements in large language models have sparked interest in their extraordinary and near-superhuman capabilities, leading researchers to explore methods for evaluating and optimizing these abilities, which is called superalignment. In this context, our paper delves into the realm of vision foundation models, focusing on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one, aiming to enhance the latter's capabilities beyond the former's limits. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our comprehensive experiments span various scenarios, including few-shot learning, transfer learning, noisy label learning, and common knowledge distillation settings. The results are striking: our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets. This compelling evidence underscores the significant potential of weak-to-strong generalization, showcasing its capability to substantially elevate the performance of vision foundation models. The code is available at https://github.com/ggjy/vision_weak_to_strong., Comment: 12 pages
- Published
- 2024
13. Rethinking Optimization and Architecture for Tiny Language Models
- Author
-
Tang, Yehui, Liu, Fangcheng, Ni, Yunsheng, Tian, Yuchuan, Bai, Zheyuan, Hu, Yi-Qi, Liu, Sichao, Jui, Shangling, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
The power of large language models (LLMs) has been demonstrated through numerous data and computing resources. However, the application of language models on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny language models with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing language models that are seldom studied carefully. In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, \ie, neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train PanGu-$\pi$-1B Pro and PanGu-$\pi$-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for PanGu-$\pi$-1B Pro. Besides, PanGu-$\pi$-1.5B Pro surpasses a range of SOTA models with larger model sizes, validating its superior performance. The code is available at https://github.com/YuchuanTian/RethinkTinyLM.
- Published
- 2024
14. FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
- Author
-
Huang, Xiaohu, Zhou, Hao, Yao, Kun, and Han, Kai
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster., Comment: Accepted by ICLR 2024
- Published
- 2024
15. A Survey on Transformer Compression
- Author
-
Tang, Yehui, Wang, Yunhe, Guo, Jianyuan, Tu, Zhijun, Han, Kai, Hu, Hailin, and Tao, Dacheng
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain., Comment: Model Compression, Transformer, Large Language Model, Large Vision Model, LLM
- Published
- 2024
16. An Empirical Study of Scaling Law for OCR
- Author
-
Rang, Miao, Bi, Zhenni, Liu, Chuanjian, Wang, Yunhe, and Han, Kai
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at https://github.com/large-ocr-model/large-ocr-model.github.io.
- Published
- 2023
17. PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation
- Author
-
Wang, Yunhe, Chen, Hanting, Tang, Yehui, Guo, Tianyu, Han, Kai, Nie, Ying, Wang, Xutao, Hu, Hailin, Bai, Zheyuan, Wang, Yun, Liu, Fangcheng, Liu, Zhicheng, Guo, Jianyuan, Zeng, Sinan, Zhang, Yinchen, Xu, Qinghua, Liu, Qun, Yao, Jun, Xu, Chao, and Tao, Dacheng
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-$\pi$. Experiments are then conducted using the same dataset and training strategy to compare PanGu-$\pi$ with state-of-the-art LLMs. The results show that PanGu-$\pi$-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-$\pi$-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-$\pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.
- Published
- 2023
18. LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
- Author
-
Nie, Ying, He, Wei, Han, Kai, Tang, Yehui, Guo, Tianyu, Du, Fanyi, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
- Published
- 2023
19. Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
- Author
-
Roberts, Jonathan, Lüddecke, Timo, Sheikh, Rehan, Han, Kai, and Albanie, Samuel
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released., Comment: V3: Fixed typo in Fig.1; V2: Minor formatting changes and added missing subfigure captions
- Published
- 2023
20. One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation
- Author
-
Hao, Zhiwei, Guo, Jianyuan, Han, Kai, Tang, Yehui, Hu, Han, Wang, Yunhe, and Xu, Chang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Knowledge distillation~(KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family, particularly the hint-based approaches. By using centered kernel alignment (CKA) to compare the learned features between heterogeneous teacher and student models, we observe significant feature divergence. This divergence illustrates the ineffectiveness of previous hint-based methods in cross-architecture distillation. To tackle the challenge in distilling heterogeneous models, we propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures. Specifically, we project intermediate features into an aligned latent space such as the logits space, where architecture-specific information is discarded. Additionally, we introduce an adaptive target enhancement scheme to prevent the student from being disturbed by irrelevant information. Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the ImageNet-1K dataset. PyTorch code and checkpoints can be found at https://github.com/Hao840/OFAKD.
- Published
- 2023
21. SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
- Author
-
Li, Xinghui, Lu, Jingyi, Han, Kai, and Prisacariu, Victor
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset., Comment: Accepted to CVPR 2024. Project website: https://sd4match.active.vision/
- Published
- 2023
22. Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition
- Author
-
He, Wei, Han, Kai, Nie, Ying, Wang, Chengcheng, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
The development of foundation vision models has pushed the general visual recognition to a high level, but cannot well address the fine-grained recognition in specialized domain such as invasive species classification. Identifying and managing invasive species has strong social and ecological value. Currently, most invasive species datasets are limited in scale and cover a narrow range of species, which restricts the development of deep-learning based invasion biometrics systems. To fill the gap of this area, we introduced Species196, a large-scale semi-supervised dataset of 196-category invasive species. It collects over 19K images with expert-level accurate annotations Species196-L, and 1.2M unlabeled images of invasive species Species196-U. The dataset provides four experimental settings for benchmarking the existing models and algorithms, namely, supervised learning, semi-supervised learning, self-supervised pretraining and zero-shot inference ability of large multi-modal models. To facilitate future research on these four learning paradigms, we conduct an empirical study of the representative methods on the introduced dataset. The dataset is publicly available at https://species-dataset.github.io/., Comment: Accepted by NeurIPS 2023 Track Datasets and Benchmarks
- Published
- 2023
23. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism
- Author
-
Wang, Chengcheng, He, Wei, Nie, Ying, Guo, Jianyuan, Liu, Chuanjian, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) have alleviated this. Therefore, this study provides an advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with convolution and self-attention operations. This new designed model named as Gold-YOLO, which boosts the multi-scale feature fusion capabilities and achieves an ideal balance between latency and accuracy across all model scales. Additionally, we implement MAE-style pretraining in the YOLO-series for the first time, allowing YOLOseries models could be to benefit from unsupervised pretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017 datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model YOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO., Comment: Accepted by NeurIPS 2023
- Published
- 2023
24. Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings
- Author
-
Liu, Yuhe, Liu, Chuanjian, Han, Kai, Tang, Quan, and Qin, Zengchang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Semantic segmentation is a computer vision task that associates a label with each pixel in an image. Modern approaches tend to introduce class embeddings into semantic segmentation for deeply utilizing category semantics, and regard supervised class masks as final predictions. In this paper, we explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely. Following this observation, we propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features. Based on this, we revisit the traditional decoding process and explore inverted information flow between segmentation masks and class embeddings. Furthermore, to ensure the discriminability and informativity of features from backbone, we propose a Feature Reconstruction module, which combines intrinsic and diverse branches together to ensure the concurrence of diversity and redundancy in features. Experiments show that our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset. The code will be released at https://gitee.com/mindspore/models and https://github.com/Carol-lyh/ECENet.
- Published
- 2023
25. Practical Parallel Algorithms for Non-Monotone Submodular Maximization
- Author
-
Cui, Shuang, Han, Kai, Tang, Jing, Huang, He, Li, Xueying, Zhiyuli, Aakas, and Li, Hanxiao
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Machine Learning - Abstract
Submodular maximization has found extensive applications in various domains within the field of artificial intelligence, including but not limited to machine learning, computer vision, and natural language processing. With the increasing size of datasets in these domains, there is a pressing need to develop efficient and parallelizable algorithms for submodular maximization. One measure of the parallelizability of a submodular maximization algorithm is its adaptive complexity, which indicates the number of sequential rounds where a polynomial number of queries to the objective function can be executed in parallel. In this paper, we study the problem of non-monotone submodular maximization subject to a knapsack constraint, and propose the first combinatorial algorithm achieving an $(8+\epsilon)$-approximation under $\mathcal{O}(\log n)$ adaptive complexity, which is \textit{optimal} up to a factor of $\mathcal{O}(\log\log n)$. Moreover, we also propose the first algorithm with both provable approximation ratio and sublinear adaptive complexity for the problem of non-monotone submodular maximization subject to a $k$-system constraint. As a by-product, we show that our two algorithms can also be applied to the special case of submodular maximization subject to a cardinality constraint, and achieve performance bounds comparable with those of state-of-the-art algorithms. Finally, the effectiveness of our approach is demonstrated by extensive experiments on real-world applications., Comment: Part of the contribution appears in AAAI-2023
- Published
- 2023
26. Guide3D: Create 3D Avatars from Text and Image Guidance
- Author
-
Cao, Yukang, Cao, Yan-Pei, Han, Kai, Shan, Ying, and Wong, Kwan-Yee K.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, text-to-image generation has exhibited remarkable advancements, with the ability to produce visually impressive results. In contrast, text-to-3D generation has not yet reached a comparable level of quality. Existing methods primarily rely on text-guided score distillation sampling (SDS), and they encounter difficulties in transferring 2D attributes of the generated images to 3D content. In this work, we aim to develop an effective 3D generative model capable of synthesizing high-resolution textured meshes by leveraging both textual and image information. To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our model involves (1) generating sparse-view images of a text-consistent character using diffusion models, and (2) jointly optimizing multi-resolution differentiable marching tetrahedral grids with pixel-aligned image features. We further propose a similarity-aware feature fusion strategy for efficiently integrating features from different views. Moreover, we introduce two novel training objectives as an alternative to calculating SDS, significantly enhancing the optimization process. We thoroughly evaluate the performance and components of our framework, which outperforms the current state-of-the-art in producing topologically and structurally correct geometry and high-resolution textures. Guide3D enables the direct transfer of 2D-generated images to the 3D space. Our code will be made publicly available., Comment: 25 pages, 22 figures
- Published
- 2023
27. Category Feature Transformer for Semantic Segmentation
- Author
-
Tang, Quan, Liu, Chuanjian, Liu, Fagui, Liu, Yifan, Jiang, Jun, Zhang, Bowen, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Aggregation of multi-stage features has been revealed to play a significant role in semantic segmentation. Unlike previous methods employing point-wise summation or concatenation for feature aggregation, this study proposes the Category Feature Transformer (CFT) that explores the flow of category embedding and transformation among multi-stage features through the prevalent multi-head attention mechanism. CFT learns unified feature embeddings for individual semantic categories from high-level features during each aggregation process and dynamically broadcasts them to high-resolution features. Integrating the proposed CFT into a typical feature pyramid structure exhibits superior performance over a broad range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.
- Published
- 2023
28. ParameterNet: Parameters Are All You Need
- Author
-
Han, Kai, Wang, Yunhe, Guo, Jianyuan, and Wu, Enhua
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we introduce a novel design principle, termed ParameterNet, aimed at augmenting the number of parameters in large-scale visual pretraining models while minimizing the increase in FLOPs. We leverage dynamic convolutions to incorporate additional parameters into the networks with only a marginal rise in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage of large-scale visual pretraining. Furthermore, we extend the ParameterNet concept to the language domain to enhance inference results while preserving inference speed. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example, ParameterNet-600M can achieve higher accuracy on ImageNet than the widely-used Swin Transformer (81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). In the language domain, LLaMA-1B enhanced with ParameterNet achieves 2\% higher accuracy over vanilla LLaMA. The code will be released at \url{https://parameternet.github.io/}., Comment: https://parameternet.github.io/
- Published
- 2023
29. HeadSculpt: Crafting 3D Head Avatars with Text
- Author
-
Han, Xiao, Cao, Yukang, Han, Kai, Zhu, Xiatian, Deng, Jiankang, Song, Yi-Zhe, Xiang, Tao, and Wong, Kwan-Yee K.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods., Comment: Webpage: https://brandonhan.uk/HeadSculpt/
- Published
- 2023
30. GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks?
- Author
-
Ding, Ning, Tang, Yehui, Fu, Zhongqian, Xu, Chao, Han, Kai, and Wang, Yunhe
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across the entire deep learning community. Such powerful large language models (LLMs) demonstrate advanced generative ability and multimodal understanding capability, which quickly achieve new state-of-the-art performances on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks, including context reasoning, article analysis and image content comprehension. However, considering the prohibitively high memory and computational cost for implementing such a large model, the conventional models (such as CNN and ViT), are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models for perception tasks (e.g. image classification) by taking advantage of large pre-trained models. We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations and achieve better performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptive text for all training images. Furthermore, we feed these detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images. During training, text embeddings will serve as extra supervising signals and be aligned with image representations learned by vision models. The alignment process helps vision models learn better and achieve higher accuracy with the assistance of pre-trained LLMs. We conduct extensive experiments to verify that the proposed algorithm consistently improves the performance for various vision models with heterogeneous architectures., Comment: GitHub: https://github.com/huawei-noah/Efficient-Computing/tree/master/GPT4Image/
- Published
- 2023
31. ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation
- Author
-
Hao, Shaozhe, Han, Kai, Zhao, Shihao, and Wong, Kwan-Yee K.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo, Comment: Under review
- Published
- 2023
32. Application analysis of extrusion and expanded pile in electric power engineering
- Author
-
Ma, Lixing, primary, Han, Kai, additional, Xue, Kai, additional, and Li, Qian, additional
- Published
- 2023
- Full Text
- View/download PDF
33. Study on Water Content and Water Saturation of Proton Exchange Membrane Fuel Cell Under Dynamic Conditions
- Author
-
Wang, Xuanyu, primary, Han, Kai, additional, Li, Xiaolong, additional, Ke, Chang, additional, and Lv, Bao, additional
- Published
- 2023
- Full Text
- View/download PDF
34. Reprogramming Initiator and Nonsense Codons to Simultaneously Install Three Distinct Noncanonical Amino Acids into Proteins in E. coli
- Author
-
Jiang, Han-Kai, primary and Tharp, Jeffery M., additional
- Published
- 2023
- Full Text
- View/download PDF
35. Study and Experimental Verification of the Effect of Assembly Pressure on the Electrical Efficiency of PEM Fuel Cells
- Author
-
Lv, Bao, primary, Han, Kai, additional, Li, Xiaolong, additional, and Wang, Xuanyu, additional
- Published
- 2023
- Full Text
- View/download PDF
36. A New Dy(III) Complex: Fluorescence Performances, Loading with Resveratrol-Hydrogels Against Skin Aging and Molecular Docking
- Author
-
Peng, Yu-Sheng, Han, Kai, Zhao, Jin-Xue, Song, Wei-Cheng, and Dai, Si-Qi
- Published
- 2024
- Full Text
- View/download PDF
37. CPSNet: a cyclic pyramid-based small lesion detection network
- Author
-
Zhu, Yan, Liu, Zhe, Song, Yuqing, Han, Kai, Qiu, Chengjian, Tang, YangYang, Zhang, Jiawen, and Liu, Yi
- Published
- 2024
- Full Text
- View/download PDF
38. Vinyl Chloride Suspension Copolymerization Combining Click with Sol-Gel Reactions for Sustainable Antifouling PVC Copolymer Ultrafiltration Membranes
- Author
-
Wang, Jianlong, primary, Han, Kai, additional, Zhao, Nana, additional, Liu, Jian, additional, Zhou, Chen, additional, Yuan, Jinfeng, additional, Pan, Zhicheng, additional, and Pan, Mingwang, additional
- Published
- 2024
- Full Text
- View/download PDF
39. A Robust Audio Deepfake Detection System via Multi-View Feature
- Author
-
Yang, Yujie, primary, Qin, Haochen, additional, Zhou, Hang, additional, Wang, Chengcheng, additional, Guo, Tianyu, additional, Han, Kai, additional, and Wang, Yunhe, additional
- Published
- 2024
- Full Text
- View/download PDF
40. Theoretical Study of the Ternary Compound Monolayer CuP2Se for Photocatalytic Water Splitting with Efficient Optical Absorption
- Author
-
Qiu, Xiaole, primary, Wang, Xiaoxuan, additional, Liu, Xiaolu, additional, Yuan, Saifei, additional, Han, Kai, additional, and Yang, Hongchao, additional
- Published
- 2024
- Full Text
- View/download PDF
41. Self-Powered Agricultural Product Preservation and Wireless Monitoring Based on Dual-Functional Triboelectric Nanogenerator
- Author
-
Wang, Wenjing, primary, Shang, Yurui, additional, Han, Kai, additional, Shi, Xue, additional, Jiang, Tao, additional, Mai, Wenjie, additional, Luo, Jianjun, additional, and Wang, Zhong Lin, additional
- Published
- 2024
- Full Text
- View/download PDF
42. The communication and measurement architecture of BDS-3 global operations and services
- Author
-
Li, Gang, primary, Guo, Shuren, additional, Gong, Wenbin, additional, Han, Kai, additional, Gao, Weiguang, additional, Shao, Fengwei, additional, Wang, Wenbin, additional, Tang, Chengpan, additional, and Zhang, Feng, additional
- Published
- 2024
- Full Text
- View/download PDF
43. Techno-economic assessment and mechanism discussion of a cogeneration shared energy storage system utilizing solid-state thermal storage: A case study in China
- Author
-
Ye, Zhaonian, primary, Han, Kai, additional, Wang, Yongzhen, additional, Li, Chengyu, additional, Zhao, Changlu, additional, He, Jijiang, additional, and Zhang, Lanlan, additional
- Published
- 2024
- Full Text
- View/download PDF
44. Regulating hollow structure of CL-20 microspheres using microjet droplet technology to enhance safety and combustion performance
- Author
-
Liu, Yi, primary, Shi, Jiahui, additional, Guo, Yunyan, additional, Xue, Zhihua, additional, Han, Kai, additional, Liu, Shujie, additional, An, Chongwei, additional, Ma, Zhongliang, additional, and Wu, Bidong, additional
- Published
- 2024
- Full Text
- View/download PDF
45. MMMViT: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities
- Author
-
Qiu, Chengjian, primary, Song, Yuqing, additional, Liu, Yi, additional, Zhu, Yan, additional, Han, Kai, additional, Sheng, Victor S., additional, and Liu, Zhe, additional
- Published
- 2024
- Full Text
- View/download PDF
46. Imbalance multiclass problem: a robust feature enhancement-based framework for liver lesion classification
- Author
-
Hu, Rui, primary, Song, Yuqing, additional, Liu, Yi, additional, Zhu, Yan, additional, Feng, Nuo, additional, Qiu, Chengjian, additional, Han, Kai, additional, Teng, Qiaoying, additional, Haq, Imran Ul, additional, and Liu, Zhe, additional
- Published
- 2024
- Full Text
- View/download PDF
47. Spatially resolved single-cell atlas of ascidian endostyle provides insight into the origin of vertebrate pharyngeal organs
- Author
-
Jiang, An, primary, Han, Kai, additional, Wei, Jiankai, additional, Su, Xiaoshan, additional, Wang, Rui, additional, Zhang, Wei, additional, Liu, Xiawei, additional, Qiao, Jinghan, additional, Liu, Penghui, additional, Liu, Qun, additional, Zhang, Jin, additional, Zhang, Nannan, additional, Ge, Yonghang, additional, Zhuang, Yuan, additional, Yu, Haiyan, additional, Wang, Shi, additional, Chen, Kai, additional, Lu, Wange, additional, Xu, Xun, additional, Yang, Huanming, additional, Fan, Guangyi, additional, and Dong, Bo, additional
- Published
- 2024
- Full Text
- View/download PDF
48. Effects of train vibration load on the structure and hydraulic properties of soils
- Author
-
Han, Kai, primary, Wang, Jiading, additional, Xiao, Tao, additional, Li, Shan, additional, Zhang, Dengfei, additional, and Dong, Haoyu, additional
- Published
- 2024
- Full Text
- View/download PDF
49. Long-Term Outcomes of dMMR/MSI-H Rectal Cancer Treated With Anti–PD-1–Based Immunotherapy as Curative-Intent Treatment
- Author
-
Yu, Jie-Hai, primary, Liao, Le-En, additional, Xiao, Bin-Yi, additional, Zhang, Xuan, additional, Wu, Ai-Wen, additional, Cheng, Yong, additional, Tang, Jing-Hua, additional, Jiang, Wu, additional, Kong, Ling-Heng, additional, Han, Kai, additional, Mei, Wei-Jian, additional, Hong, Zhi-Gang, additional, Yang, Wan-Jun, additional, Li, Dan-Dan, additional, Pan, Zhi-Zhong, additional, Li, Yun-Feng, additional, Zhang, Xiao-Shi, additional, and Ding, Pei-Rong, additional
- Published
- 2024
- Full Text
- View/download PDF
50. Surfactant-Assisted Synthesis of Hybrid Copper(I) Halide Nanocrystals for X-ray Scintillation Imaging
- Author
-
Gu, Ranran, primary, Han, Kai, additional, Jin, Jiance, additional, Zhang, Hao, additional, and Xia, Zhiguo, additional
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.